Adaptive Coherence Batching for Trap-Based Memory Architectures
نویسندگان
چکیده
Both software-initiated and hardware-initiated prefetching have been used to accelerate shared-memory server performance. While software-initiated prefetching require instruction set and compiler support, hardware prefetching often require additional hardware structures or extra memory state. The coherence batching scheme proposed in this paper keeps the system completely binary transparent and does not rely on any additional hardware. Hence, it can be implemented without additional hardware in software coherent systems and improve performance for already optimized and compiled binaries. We have evaluated our proposals on a trap-based memory architecture where fine-grained coherence permission checks are done in hardware but the coherence protocol is run in software on the requesting processor. Functional fullsystem simulation shows that our software-only coherencebatch scheme is able to reduce the number of coherence misses with up to 60 percent compared to a system without coherence batching. The average miss reduction is 37 percent while the average bandwidth usage is reduced.
منابع مشابه
Reliability and Performance Evaluation of Fault-aware Routing Methods for Network-on-Chip Architectures (RESEARCH NOTE)
Nowadays, faults and failures are increasing especially in complex systems such as Network-on-Chip (NoC) based Systems-on-a-Chip due to the increasing susceptibility and decreasing feature sizes. On the other hand, fault-tolerant routing algorithms have an evident effect on tolerating permanent faults and improving the reliability of a Network-on-Chip based system. This paper presents reliabili...
متن کاملScalable directoryless shared memory coherence using execution migration
We introduce the concept of deadlock-free migration-based coherent shared memory to the NUCA family of architectures. Migration-based architectures move threads among cores to guarantee sequential semantics in large multicores. Using a execution migration (EM) architecture, we achieve performance comparable to directory-based architectures without using directories: avoiding automatic data repl...
متن کاملProgramming Research Group PRACTICAL BARRIER SYNCHRONISATION
We investigate the performance of barrier synchronisation on both shared-memory and distributed-memory architectures, using a wide range of techniques. The performance results obtained show that distributed-memory architectures behave predictably, although their performance for barrier synchronisation is relatively poor. For shared-memory architectures, a much larger range of implementation tec...
متن کاملPractical barrier synchronisation
We investigate the performance of barrier syn-chronisation on both shared-memory and distributed-memory architectures, using a wide range of techniques. The performance results obtained show that distributed-memory architectures behave predictably, although their performance for barrier synchronisation is relatively poor. For shared-memory architectures, a much larger range of implementation te...
متن کاملUniversity of Delaware Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory A New Cache Protocol Based On The Order Free Consistency Memory Model
Computer architects are now studying a new generation of chip architectures that may integrate hundreds of processing cores and memory banks on a single chip with novel interconnect technologies. A key challenge lies in the design and development of an efficient on-chip shared memory organization for these future many-core architectures. New approaches need to be developed to address this chall...
متن کامل